Skip to content

fix(ingest/snowflake): resolve external stage lineage via DataHub graph#17358

Open
alokr-dhub wants to merge 7 commits into
masterfrom
fix/snowflake_stages_external_lineage
Open

fix(ingest/snowflake): resolve external stage lineage via DataHub graph#17358
alokr-dhub wants to merge 7 commits into
masterfrom
fix/snowflake_stages_external_lineage

Conversation

@alokr-dhub
Copy link
Copy Markdown
Contributor

@alokr-dhub alokr-dhub commented May 8, 2026

Problem

The Snowflake connector previously resolved external stage lineage by generating a single dataset URN directly from the raw stage URL (e.g. s3://my-bucket/data/ → one S3 URN). This produces incorrect lineage because:

  1. A stage typically points to a path prefix, not a single dataset — the actual datasets ingested by the S3/GCS/ABS source are individual folders/tables under that prefix.
  2. The generated URN will never match URNs produced by the data lake source when that source was ingested with a platform_instance, breaking the lineage graph.

Solution

Graph-based stage resolution (opt-in)

When resolve_external_stage_lineage_via_graph is enabled, the connector looks up every existing dataset URN in DataHub whose path is rooted at the stage's URL path. A single stage can therefore fan out to multiple upstream datasets that the corresponding data lake source previously ingested.

source:
  type: snowflake
  config:
    # ...
    resolve_external_stage_lineage_via_graph: true
    # Optional: set if the S3/GCS/ABS source was ingested with a platform_instance
    external_stage_platform_instance: "prod"

Changes

  • New utility data_lake_common/path_urn_resolver.pyDataLakePathResolver bulk-fetches every dataset URN under a bucket once per (platform, platform_instance, bucket) and matches client-side. N stage paths sharing a bucket cost a single graph call, not N. Designed to be reusable across connectors (Redshift UNLOAD, external tables, etc.) and intentionally report-free; transient failures are surfaced via DataLakeUrnLookup.transient_error for callers to log and count.
  • SnowflakeStagesExtractor — added _parse_external_stage_url, _compute_external_stage_urns, and _resolve_stage_via_graph. URL resolutions are cached per stage URL within a run; transient graph failures are not cached so a later stage with the same URL can retry.
  • StageLookupEntry.dataset_urn: Optional[str]dataset_urns: Tuple[str, ...] — stages can now resolve to multiple upstream datasets. The dataclass is frozen and constructed once with its final URNs.
  • ParsedStageUrl — frozen dataclass; bucket is derived from path via a @property so the invariant holds by construction.
  • snowflake_pipes.py — pipe extractor fans out every resolved stage URN into inputDatasets and deduplicates across stages using dict.fromkeys (insertion-order preserving, O(n)).
  • snowflake_config.py — two new fields: resolve_external_stage_lineage_via_graph (default False) and external_stage_platform_instance (default None). Validators normalize blank platform instances to None and warn when external_stage_platform_instance is set without the resolve flag.
  • snowflake_report.py — added external_stage_lineage_resolved and external_stage_lineage_unresolved counters.
  • snowflake_v2.py — passes ctx.graph to SnowflakeStagesExtractor.
  • Robustness_compute_external_stage_urns now also catches InvalidUrnError so a malformed gcs:// (empty path is rejected by the URN constructor) no longer crashes extraction; the existing structured warning is emitted instead.
  • Integration test fixtures — fixed naive datetime objects to use tzinfo=timezone.utc so containerized CI in non-UTC zones is deterministic.

False-positive prevention

A START_WITH URN filter for s3://my-bucket/folder would incorrectly match s3://my-bucket/folder_other. The dataset_path_is_rooted_at helper (in path_urn_resolver.py) rejects such false positives by requiring the character immediately after the prefix to be / (child path) or , (URN env separator, i.e. exact match). The same check also drops sibling buckets that the case-insensitive START_WITH wildcard might over-match.

Behaviour without the flag

When resolve_external_stage_lineage_via_graph is False (default), the connector falls back to the existing path-based URN generation — no change in behaviour for existing users.

Test plan

  • tests/unit/data_lake_common/test_path_urn_resolver.py — covers the bucket-scoped fetch contract (one fetch per bucket, separate buckets fetched separately, scope is the bucket not the full path), sibling-bucket exclusion, transient-error pass-through with no caching, and the dataset_path_is_rooted_at boundary check parametrized across child/exact/sibling/wrong-platform/malformed inputs.
  • TestGraphResolvedExternalStageLineage — S3/GCS/ABS resolution, empty matches, disabled-flag fallback, graph errors (with and without cache poisoning), platform-instance forwarding, URL caching, unsupported schemes, empty paths (both s3:// and gcs://), internal stages, and malformed URLs.
  • TestExternalStageConfigValidators — blank/whitespace normalization, stripping, and the cross-field warning.
  • TestSnowflakePipesExtractor — fan-out across multiple resolved URNs, cross-stage deduplication, and the stages_without_urn warning path (resolved/empty mix and the all-empty case).
  • ./gradlew :metadata-ingestion:lint — ruff + mypy clean.
  • ./gradlew :metadata-ingestion:testQuick — 546 Snowflake unit tests pass.

🤖 Generated with Claude Code

@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label May 8, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 97.01493% with 4 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...hub/ingestion/source/snowflake/snowflake_stages.py 97.14% 2 Missing ⚠️
...stion/source/data_lake_common/path_urn_resolver.py 97.61% 1 Missing ⚠️
...hub/ingestion/source/snowflake/snowflake_config.py 92.85% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests
Copy link
Copy Markdown

datahub-connector-tests Bot commented May 8, 2026

Connector Tests Results

Connector tests failed for commit 2b87fae

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

@alokr-dhub alokr-dhub changed the title fix(ingest/snowflake): snowpipes upstream lineage handling fix(ingest/snowflake): resolve external stage lineage via DataHub graph May 8, 2026
@alokr-dhub alokr-dhub marked this pull request as ready for review May 8, 2026 15:36
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Linear: ING-2599

@maggiehays maggiehays added the needs-review Label for PRs that need review from a maintainer. label May 8, 2026
@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels May 11, 2026
…raph and external_stage_platform_instance fields

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@alwaysmeticulous
Copy link
Copy Markdown

alwaysmeticulous Bot commented May 15, 2026

🔴 Meticulous spotted visual differences in 116 of 1330 screens tested: view and approve differences detected.

Meticulous evaluated ~10 hours of user flows against your PR.

Last updated for commit e0388aa refactor(ingest/snowflake): tighten external-stage lineage types and add.... This comment will update as new commits are pushed.

@maggiehays maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels May 15, 2026
…tform_instance field

The merge of the marketplace feature (#14304) into this branch dropped the
closing ) for the external_stage_platform_instance Field() definition,
producing a SyntaxError on import.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 2, 2026

Your PR has been assigned to @sgomezvillamor (sergio.gomez) for review (ING-2599).

…add path resolver utility

- Extract DataLakePathResolver to data_lake_common/: a reusable bucket-scoped
  path -> existing-URN graph-lookup utility with per-run caching. One graph
  call per bucket regardless of how many stage paths share it.
- Catch InvalidUrnError alongside ValueError when parsing stage URLs so a
  malformed gcs:// (empty path) no longer crashes extraction.
- Make ParsedStageUrl, StageLookupEntry, and DataLakeUrnLookup frozen
  dataclasses with Tuple[str, ...] URN lists; construct StageLookupEntry once
  with its final URNs. Bucket is derived from path via @Property so the
  invariant holds by construction.
- Replace lazy _get_path_resolver/assert pattern with __post_init__
  construction; raise RuntimeError instead of asserting so the guard stays
  loud under python -O.
- Replace O(n^2) input_datasets dedup with dict.fromkeys (preserves order).
- Add tests for stages_without_urn warning path, gcs empty-path parse, and
  the DataLakePathResolver bucket-boundary contract.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants